# Lab 11 - Comparing distributions visually 

This lab will use two days worth of data from the 311 Service Request dataset on NYC Open Data. 

We will download the 311 data from March 3 and 4, 2019:
1. Go to https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 and click "View Data". 
2. We will filter the data to only contain complaints made on March 3 or 4, 2019: 
    * If necessary, click on "Filter", and then click on "Add new filter condition". 
    * Select the column "Created Date" and change "is" to "is between". 
    * For the first date, select 03/03/2019 12:00:00 AM 
    * For the second date, select 03/05/2019 12:00:00 AM 
    * Check the box to left of the first date. The data should change, so that only those complaints created on March 3 or 4, 2019 show. 
3. To download the filtered data, click Export, then CSV. 
4. If necessary, rename the file so it will be named differently than the previous 311 data file.
5. Upload your new data file to Jupyter Hub.

As usual we import the matplotlib and pandas packages and set plots to appear in Jupyter notebook.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

Load your new 311 data file into the dataframe `calls`.  Use the `parse_dates` parameter to store the `Created Date` column as a `DatetimeIndex` type instead of string (see Lab 2 if you forget how to do this).  In other words, this parameter tells Pandas that the `Created Date` column in actually a date/time and not just some random text.  Parsing the date takes some extra time, so we have been skipping this step if we are not using the date.

<details> <summary>Answer:</summary>
    <code>calls = pd.read_csv("../Data/Mar3_4_2019_311_Service_Requests.csv",parse_dates=["Created Date"])</code>
</details>

Display `calls` to check that it was created correctly. 

## Comparing complaint times on Sunday March 3 vs. Monday March 4

First we are going to compare the times complaints were made on Sunday March 3 with the times complaints were made on Monday March 4.  We will use histograms plotted on top of each other to visualize and compare the two distributions.  Before starting, how do you think timing of complaints will differ between Sunday and Monday?

First we will create a filter to find all complaints created on March 3.  Type `mar3_filter = calls["Created Date"].dt.day ==3` below and run the code.  The `.dt.day` gets just the day from the `DatetimeIndex` type and `==3` checks if it is equal to 3. 

Use your March 3 `mar3_filter` to create a new dataframe called `mar3_calls` containing just the complaints that match this filter.  

<details> <summary>Hint:</summary>
    See Lab 9 or remember that we select from a dataframe using square brackets.  ex. `df["column_name"]` selects the column `column_name` from the dataframe `df`.
</details>

<details> <summary>Answer:</summary>
    <code>mar3_calls = calls[mar3_filter]</code>
</details>

Now create a dataframe called `mar4_calls` containing only the calls from March 4.  That is, create the filter and then use it to select from the `calls` dataframe.

<details> <summary>Answer:</summary>
    <code>mar4_filter = calls["Created Date"].dt.day ==4
mar4_calls = calls[mar4_filter]</code>
</details>

We will now plot a histogram of the hours of the March 3 calls.  We can get a Series of the hour each March 3 call was created with `mar3_calls["Created Date"].dt.hour`.  Type and run this below to check this.

So to make a histogram of the hours, we can just type `mar3_calls["Created Date"].dt.hour.hist()` below and run it.

What do you notice about the distribution?  You may want to increase the number of bins to 24.

To plot two histograms on the same graph, just put the two pieces of code one after another in the same cell and run it.  Replot the histogram of the hours the March 3 complaints were created below and along with the histogram of the hours of the March 4 complaints.

<details> <summary>Answer:</summary>
    <code>mar3_calls["Created Date"].dt.hour.hist()
mar4_calls["Created Date"].dt.hour.hist()</code>
</details>

The first histogram (March 3) is in blue and the second (March 4) in orange.  Can you see all of both histograms?  What do you notice?
 
Let's make this plot look nicer.  To make a histogram transparent, add the parameter `alpha = 0.5`.  You can also change the color of a histogram to red by adding the parameter `color = "red"`.  A list of the possible colors is [here](https://matplotlib.org/examples/color/named_colors.html).  

Can you add axis labels and a title?  

Finally, we can add a legend with `plt.legend(["Sunday March 3","Monday March 4"])` (This legend assumes the first histogram is for March 3 and the second is for March 4.)

What do you notice?  What differences are there between the distributions of the times of the 311 complaints on Sunday March 3 and Monday March 4?

It looks like there are more complaints on March 4 overall.  So instead of comparing the absolute number of complaints at each time, it might be better to compare the proportion of complaints at each time.  We do this by adding the parameter `density = True` to each histogram function.  Try this below.

<details> <summary>Answer:</summary>
    <code>mar3_calls["Created Date"].dt.hour.hist(bins = 24,alpha = 0.5,density = True)
mar4_calls["Created Date"].dt.hour.hist(bins = 24,alpha = 0.5,density = True)
plt.title("311 complaints by hour")
plt.xlabel("Hour")
plt.ylabel("Proportion of complaints")
plt.legend(["Sunday March 3","Monday March 4"])</code>
</details>

## Comparing complaint distribution by borough on March 3 vs. March 4

Now we will compare the distribution of complaints by borough on Sunday March 3 vs. Monday March 4.  Do you think the distribution will change?  If so, how?

Recall that we created `mar3_calls` as a dataframe of just the March 3 calls and `mar4_calls` as a dataframe of just the March 4 calls.

First, get the value counts of the `Borough` column in `mar3_calls` and in `mar4_calls` and store these in the variables `mar3_borough_counts` and `mar4_borough_counts` respectively.

<details> <summary>Answer:</summary>
    <code>mar3_borough_counts = mar3_calls["Borough"].value_counts()
mar4_borough_counts = mar4_calls["Borough"].value_counts()</code>
</details>

Just like with the histograms, we can plot two bar charts on top of each other by putting the functions in the same cells.  Try to make an overlapping plot the two bar charts of the borough counts for March 3 and for March 4 below. 

<details> <summary>Answer:</summary>
    <code>mar3_borough_counts.plot(kind = "bar")
mar4_borough_counts.plot(kind = "bar")</code>
</details>

We can't see the difference between the two plots, so add a parameter to change the color of the March 3 one to blue and the color of the March 4 one to red (or whatever two colors you like).  Also, to further distinguish between the two plots, we can add the parameter `width = 0.75` to the March 3 bar chart and the parameter `width = 0.5` to the March 4 bar chart.

<details> <summary>Answer:</summary>
    <code>mar3_borough_counts.plot(kind = "bar",color = "blue",width = 0.75)
mar4_borough_counts.plot(kind = "bar",color = "red",width = 0.5)</code>
</details>

What do you think the `width` parameter does?  What do you notice about the graph?  Which day has more 311 calls?

Since there are more 311 calls on Monday than on Sunday, it is hard to tell if the distribution between boroughs is different.  We can normalize the counts so that they are proportions instead of counts by dividing by the total number of calls on each day.  For example, to normalize the March 3 counts, we write the following.  Try adding the code to normalize the March 4 counts.

In [None]:
normalized_mar3_borough_counts = mar3_borough_counts/mar3_calls.shape[0]

<details> <summary>Answer:</summary>
    <code>normalized_mar3_borough_counts = mar3_borough_counts/mar3_calls.shape[0]
normalized_mar4_borough_counts = mar4_borough_counts/mar4_calls.shape[0]</code>
</details>

Finally, let's plot the normalized counts or proportions:

<details> <summary>Answer:</summary>
    <code>normalized_mar3_borough_counts.plot(kind = "bar",color = "blue",width = 0.75)
normalized_mar4_borough_counts.plot(kind = "bar",color = "red",width = 0.5)</code>
</details>

Can you add a title, axis labels, and a legend?

We can make all sorts of interesting comparisons with filters.  Here is one comparing the type of calls made at night (6pm - 6am) with the type of calls made during the day.  Try running the code below.

In [None]:
before_6am_filter = calls["Created Date"].dt.hour < 6
after_6pm_filter = calls["Created Date"].dt.hour > 18

night_calls = calls[before_6am_filter | after_6pm_filter]
day_calls = calls[~before_6am_filter & ~after_6pm_filter]

night_counts = night_calls["Complaint Type"].value_counts()
night_counts[night_counts >50].plot(kind = "bar",color = "blue",width = 0.75)
day_counts = day_calls["Complaint Type"].value_counts()
day_counts[day_counts > 100].plot(kind = "bar",color = "red",width = 0.5)

#### Challenges
- Plot overlapping histograms of the hours that residential noise and commerical noise complaints are made.  Residential noise complaints are listed as `Noise - Residential` under `Complaint Type` and commerical noise complaints are listed as `Noise - Commercial` under `Complaint Type`.
- Plot overlapping bar charts of the borough distribution of no heat/hot water complaints (`HEAT/HOT WATER`) and another complaint of your choice.  Is there a different in the distributions?
- Your choice!  Pick one variable to compare in two different situations.  You can find a list of the different types of information you can get from teh `DatetimeIndex` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html).